Sally: a tool for embedding strings in vector spaces

نویسندگان

  • Konrad Rieck
  • Christian Wressnegger
  • Alexander Bikadorov
چکیده

Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Kernel Dependency Estimation

We consider the learning problem of finding a dependency between a general class of objects and another, possibly different, general class of objects. The objects can be for example: vectors, images, strings, trees or graphs. Such a task is made possible by employing similarity measures in both input and output spaces using kernel functions, thus embedding the objects into vector spaces. We exp...

متن کامل

On Generalized Injective Spaces in Generalized Topologies

In this paper, we first present a new type of the concept of open sets by expressing some properties of arbitrary mappings on a power set. With the generalization of the closure spaces in categorical topology, we introduce the generalized topological spaces and the concept of generalized continuity and become familiar with weak and strong structures for generalized topological spaces. Then, int...

متن کامل

Embedding measure spaces

‎For a given measure space $(X,{mathscr B},mu)$ we construct all measure spaces $(Y,{mathscr C},lambda)$ in which $(X,{mathscr B},mu)$ is embeddable‎. ‎The construction is modeled on the ultrafilter construction of the Stone--v{C}ech compactification of a completely regular topological space‎. ‎Under certain conditions the construction simplifies‎. ‎Examples are given when this simplification o...

متن کامل

Link Prediction using Network Embedding based on Global Similarity

Background: The link prediction issue is one of the most widely used problems in complex network analysis. Link prediction requires knowing the background of previous link connections and combining them with available information. The link prediction local approaches with node structure objectives are fast in case of speed but are not accurate enough. On the other hand, the global link predicti...

متن کامل

Embedding normed linear spaces into $C(X)$

‎It is well known that every (real or complex) normed linear space $L$ is isometrically embeddable into $C(X)$ for some compact Hausdorff space $X$‎. ‎Here $X$ is the closed unit ball of $L^*$ (the set of all continuous scalar-valued linear mappings on $L$) endowed with the weak$^*$ topology‎, ‎which is compact by the Banach--Alaoglu theorem‎. ‎We prove that the compact Hausdorff space $X$ can ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Machine Learning Research

دوره 13  شماره 

صفحات  -

تاریخ انتشار 2012